fix(syncing): skip draining when exec client unavailable #3060

julienrbrt · 2026-02-10T08:39:39Z

Overview

Skip draining when exec client unavailable.
Finally put my last read to execution (https://go.dev/blog/synctest), as it was relevant here.

github-actions · 2026-02-10T08:39:53Z

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

Build	Format	Lint	Breaking	Updated (UTC)
`✅ passed`	`⏩ skipped`	`✅ passed`	`✅ passed`	Feb 10, 2026, 11:02 AM

claude · 2026-02-10T08:39:56Z

Claude finished @julienrbrt's task in 1m 50s —— View job

Code Review Complete

I've performed a comprehensive review of the syncing shutdown fix. Here's my detailed analysis:

Summary

This PR implements a fix to prevent the syncer from attempting to drain pending events during shutdown when a critical error (execution client unavailable) has occurred. The implementation is sound and well-tested, with good use of atomic operations and the new testing/synctest package.

Positive Aspects ✅

1. Clean Error Handling Architecture

The introduction of hasCriticalError atomic.Bool is a clean, race-free way to track critical failures
The sendCriticalError() function properly sets the flag and attempts non-blocking channel send
Error wrapping with context is consistent (fmt.Errorf("failed to execute transactions: %w", err))

2. Excellent Test Coverage

Three new test cases covering the critical paths:
- TestProcessHeightEvent_ExecutionFailure_DoesNotReschedule - verifies events aren't re-queued after execution failure
- TestSyncer_Stop_SkipsDrainOnCriticalError - confirms drain is skipped when critical error flag is set
- TestSyncer_Stop_DrainWorksWithoutCriticalError - ensures normal drain still works
Good use of synctest.Test() wrapper for time-dependent tests (lines 543, 806, 945 in syncer_test.go)
All existing tests updated to use synctest where appropriate

3. Concurrency Safety

Proper use of atomic.Bool for the critical error flag (syncer.go:121)
No race conditions in the shutdown logic
The conditional drain check (syncer.go:260) is thread-safe

4. Graceful Degradation

The 5-second drain timeout is preserved for normal shutdown (syncer.go:261)
Events that fail validation are still properly categorized (syncer.go:688-698)

Issues & Concerns ⚠️

1. CRITICAL: Removal of Shutdown Timeout Safety Net

Location: pkg/cmd/run_node.go:226-235

The PR removes the 5-second timeout from the shutdown wait logic:

// Before (implicit from removed code):
case <-time.After(5 * time.Second):
    // ... timeout handling

// After:
case <-time.After(10 * time.Second):
    return fmt.Errorf("shutdown timeout exceeded")

Wait, actually looking at the code, it increases the timeout from an implicit handler to 10 seconds. However, the issue is that this timeout only covers the node shutdown wait, not all components.

Concern: While the syncer's drain logic now intelligently skips when there's a critical error, other components might still hang during shutdown. The gemini-code-assist reviewer correctly flagged this.

Recommendation: Consider adding a fallback timeout mechanism or ensure all components have their own shutdown timeouts. The 10-second timeout is reasonable, but document why this value was chosen.

Review timeout removal →

2. Incomplete Error Classification

Location: block/internal/syncing/syncer.go:688-698

The logic that decides whether to reschedule events has this pattern:

switch {
case errors.Is(err, errInvalidBlock) || s.hasCriticalError.Load():
    // do not reschedule
case errors.Is(err, errMaliciousProposer):
    s.sendCriticalError(...)
case errors.Is(err, errInvalidState):
    s.sendCriticalError(...)
default:
    s.cache.SetPendingEvent(height, event)
}

Issue: The code checks hasCriticalError.Load() AFTER TrySyncNextBlock returns, but sendCriticalError is called inside ApplyBlock (line 818). This means:

ApplyBlock fails and calls sendCriticalError (sets flag to true)
Control returns to processHeightEvent line 681
The error check on line 688 sees hasCriticalError.Load() == true
Event is not rescheduled

This works correctly, but the logic could be clearer. The flag is being used both to skip draining AND to prevent rescheduling.

Minor suggestion: Add a comment explaining that hasCriticalError is checked here because it may have been set during the TrySyncNextBlock call.

3. Missing Error Type Definition

Location: block/internal/syncing/syncer.go:703-708

The comment in the overview mentions introducing errExecutionClientFailure, but I don't see this error defined in the code. Looking at the logic:

var (
    errInvalidBlock = errors.New("invalid block")
    errInvalidState = errors.New("invalid state")
)

There's no errExecutionClientFailure sentinel error. This contradicts the PR description.

However, looking at the implementation, the code doesn't need a specific error type because it uses the hasCriticalError flag instead. This is actually cleaner than wrapping errors.

Recommendation: Update the PR description to remove the mention of errExecutionClientFailure since it wasn't implemented (or clarify why it wasn't needed).

4. Test Coverage Gap

Coverage: 66.67% patch coverage with 9 uncovered lines (per Codecov)

Looking at the test file, the new tests are excellent, but there are some edge cases not covered:

What happens if sendCriticalError is called multiple times concurrently?
What if the drain context times out while processing an event?
What if Stop() is called multiple times?

These are edge cases that might not be worth testing, but they're worth considering for robustness.

The fact that pkg/cmd/run_node.go has 0% coverage for the changed lines is expected (integration-level code).

Code Quality Assessment

Architecture ⭐⭐⭐⭐⭐

Clean separation of concerns
Good use of atomic operations for shared state
Proper error propagation through channels

Concurrency Safety ⭐⭐⭐⭐⭐

All shared state properly protected with atomics
No data races (atomic.Bool, atomic.Uint64, atomic.Pointer)
Proper context cancellation handling

Error Handling ⭐⭐⭐⭐

Consistent error wrapping with %w
Good distinction between recoverable and critical errors
Minor: Could benefit from more specific error types (but current approach is pragmatic)

Testing ⭐⭐⭐⭐⭐

Comprehensive test coverage of the critical path
Good use of synctest for time-dependent tests
Tests verify both positive and negative cases

Documentation ⭐⭐⭐

Good inline comments explaining the drain skip logic (syncer.go:258-259)
Changelog entry is clear
Missing: godoc for hasCriticalError field explaining its purpose

Go Best Practices Review

✅ Follows Best Practices:

Atomic operations: Correct use of atomic.Bool for flag (no mutexes needed for simple boolean)
Error handling: Proper error wrapping with %w
Testing: Table-driven tests with synctest for deterministic time-based testing
Context handling: Proper context propagation and cancellation
Channel operations: Non-blocking send to error channel (line 1143)

⚠️ Minor Observations:

Magic numbers: The 5-second drain timeout (line 261) could be a named constant
Context cleanup: The drain context (line 261-262) is properly cleaned up with defer
Nil checks: Proper nil check for cancel function (line 250)

Security Considerations

✅ No Security Issues Found

No credential exposure in logs
Proper error message sanitization
No potential for deadlocks in the new code
Atomic operations prevent data races

Performance Considerations

✅ Performance Impact: Minimal

Atomic operations are very fast (single CPU instruction)
Skip draining on critical error actually improves shutdown performance
No additional goroutines or allocations introduced

Recommendations

High Priority:

Document the shutdown timeout decision - Add a comment explaining why 10 seconds is sufficient or adjust based on observed component shutdown times

Medium Priority:

Add godoc for hasCriticalError - Explain it's used both for skipping drain and preventing event rescheduling
Update PR description - Remove mention of errExecutionClientFailure or clarify why it wasn't implemented as a sentinel error

Low Priority:

Named constant for drain timeout - Make 5*time.Second a named constant like drainTimeout
Consider additional edge case tests - Concurrent sendCriticalError calls, multiple Stop() calls

Final Verdict: LGTM with Minor Suggestions ✅

This is a well-implemented fix that solves the shutdown hanging problem elegantly. The use of synctest from the Go 1.23 blog post shows good awareness of modern Go testing patterns. The test coverage is comprehensive for the critical paths.

The main concern raised by the gemini-code-assist reviewer about removing the timeout is valid, but the code actually increases the timeout to 10 seconds and makes it explicit, which is reasonable. However, verifying that all components shut down within this window would be prudent.

The code is production-ready with the minor documentation improvements suggested above.

Branch: julien/fix-shutdown

gemini-code-assist · 2026-02-10T08:40:00Z

Summary of Changes

Hello @julienrbrt, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refines the Syncer component's shutdown process by introducing a mechanism to gracefully handle critical errors, specifically when the execution client becomes unavailable. By tracking a hasCriticalError flag, the system can now intelligently skip the draining of pending events during shutdown, thereby preventing redundant and failing operations and ensuring a more responsive and efficient termination in error states. This change improves the overall robustness and user experience during node shutdowns under adverse conditions.

Highlights

Conditional Drain on Shutdown: The Syncer.Stop() method now conditionally skips draining the heightInCh event channel if a critical error, such as an unavailable execution client, has occurred. This prevents unnecessary processing of events that are likely to fail during shutdown.
Critical Error Tracking: An atomic.Bool field, hasCriticalError, was added to the Syncer struct to track whether a critical error has been encountered. This flag is set when sendCriticalError is called.
New Error Type for Execution Client Failure: A new error, errExecutionClientFailure, was introduced to specifically identify issues where the execution client is unavailable after all retries. This error is used to trigger the conditional draining logic.
Improved Error Handling: The Syncer.ApplyBlock method now wraps execution client errors with errExecutionClientFailure and sets the hasCriticalError flag. Additionally, Syncer.processHeightEvent was updated to avoid rescheduling events if errExecutionClientFailure is detected.
Enhanced Testing: New test cases were added to verify that events are not rescheduled after execution client failures and that the Syncer.Stop() method correctly skips draining when a critical error is present, while also confirming normal draining behavior in its absence.
Removed Fixed Shutdown Timeout: A fixed 5-second timeout was removed from the node's shutdown wait logic in pkg/cmd/run_node.go, as the Syncer's shutdown behavior is now more intelligently handled.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

block/internal/syncing/syncer.go
- Added hasCriticalError atomic boolean field to Syncer struct.
- Modified Stop() method to check hasCriticalError before initiating the drain loop for heightInCh.
- Defined a new error errExecutionClientFailure to signify an unavailable execution client after retries.
- Updated processHeightEvent to prevent rescheduling of events if errExecutionClientFailure is encountered.
- Modified ApplyBlock to wrap execution client errors with errExecutionClientFailure and set hasCriticalError.
- Ensured sendCriticalError sets the hasCriticalError flag.
block/internal/syncing/syncer_test.go
- Imported testing/synctest for improved test context management.
- Wrapped TestSyncer_executeTxsWithRetry with synctest.Test.
- Added TestProcessHeightEvent_ExecutionFailure_DoesNotReschedule to verify non-rescheduling of events on execution failure.
- Added TestSyncer_Stop_SkipsDrainOnCriticalError to confirm drain skipping during critical errors.
- Added TestSyncer_Stop_DrainWorksWithoutCriticalError as a sanity check for normal drain operation.
pkg/cmd/run_node.go
- Removed a hardcoded 5-second timeout from the node's shutdown wait select statement.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a fix to prevent the syncer from attempting to drain pending events during shutdown when a critical error, such as an unavailable execution client, has occurred. This is achieved by adding an atomic.Bool flag hasCriticalError to the Syncer. This flag is set when a critical error is reported, and checked within the Stop() method to conditionally skip the draining logic. The changes also ensure that events failing due to execution client unavailability are not rescheduled. The accompanying tests are thorough and validate the new behavior correctly.

One point of concern is the removal of the shutdown timeout in pkg/cmd/run_node.go. While the primary cause of shutdown hangs is addressed, removing the timeout completely could lead to indefinite hangs if other parts of the system fail to shut down gracefully. I've added a comment suggesting to reconsider this change.

codecov · 2026-02-10T08:43:13Z

Codecov Report

❌ Patch coverage is 62.50000% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 61.83%. Comparing base (dd87ec9) to head (56bbff5).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
block/internal/syncing/syncer.go	75.00%	4 Missing and 1 partial ⚠️
pkg/cmd/run_node.go	0.00%	4 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3060      +/-   ##
==========================================
+ Coverage   61.61%   61.83%   +0.21%     
==========================================
  Files         111      111              
  Lines       11120    11123       +3     
==========================================
+ Hits         6852     6878      +26     
+ Misses       3530     3506      -24     
- Partials      738      739       +1

Flag	Coverage Δ
combined	`61.83% <62.50%> (+0.21%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

tac0turtle

utACK

julienrbrt added 2 commits February 10, 2026 09:28

fix(syncing): skip draining when exec client unavailable

2f9aada

add tests

420e41d

github-actions bot assigned julienrbrt Feb 10, 2026

gemini-code-assist bot reviewed Feb 10, 2026

View reviewed changes

julienrbrt added 3 commits February 10, 2026 09:50

use synctest whenever useful

14a09d9

cl

49f0aa9

fixes

d7f9842

julienrbrt marked this pull request as ready for review February 10, 2026 08:58

julienrbrt requested a review from tac0turtle February 10, 2026 08:58

tac0turtle approved these changes Feb 10, 2026

View reviewed changes

julienrbrt added 2 commits February 10, 2026 11:59

cleanup

30784f4

Merge branch 'main' into julien/fix-shutdown

56bbff5

julienrbrt enabled auto-merge February 10, 2026 11:02

julienrbrt added this pull request to the merge queue Feb 10, 2026

Merged via the queue into main with commit fa3f0e2 Feb 10, 2026
44 of 47 checks passed

julienrbrt deleted the julien/fix-shutdown branch February 10, 2026 12:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(syncing): skip draining when exec client unavailable #3060

fix(syncing): skip draining when exec client unavailable #3060

Uh oh!

julienrbrt commented Feb 10, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 10, 2026 •

edited

Loading

Uh oh!

claude bot commented Feb 10, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Feb 10, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

codecov bot commented Feb 10, 2026 •

edited

Loading

Uh oh!

tac0turtle left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix(syncing): skip draining when exec client unavailable #3060

fix(syncing): skip draining when exec client unavailable #3060

Uh oh!

Conversation

julienrbrt commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Uh oh!

github-actions bot commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude bot commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Complete

Summary

Positive Aspects ✅

1. Clean Error Handling Architecture

2. Excellent Test Coverage

3. Concurrency Safety

4. Graceful Degradation

Issues & Concerns ⚠️

1. CRITICAL: Removal of Shutdown Timeout Safety Net

2. Incomplete Error Classification

3. Missing Error Type Definition

4. Test Coverage Gap

Code Quality Assessment

Architecture ⭐⭐⭐⭐⭐

Concurrency Safety ⭐⭐⭐⭐⭐

Error Handling ⭐⭐⭐⭐

Testing ⭐⭐⭐⭐⭐

Documentation ⭐⭐⭐

Go Best Practices Review

✅ Follows Best Practices:

⚠️ Minor Observations:

Security Considerations

✅ No Security Issues Found

Performance Considerations

✅ Performance Impact: Minimal

Recommendations

High Priority:

Medium Priority:

Low Priority:

Final Verdict: LGTM with Minor Suggestions ✅

Uh oh!

gemini-code-assist bot commented Feb 10, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

codecov bot commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

tac0turtle left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

julienrbrt commented Feb 10, 2026 •

edited

Loading

github-actions bot commented Feb 10, 2026 •

edited

Loading

claude bot commented Feb 10, 2026 •

edited

Loading

codecov bot commented Feb 10, 2026 •

edited

Loading